Here we shall try to create a basic dataframe which we can pass to the Scikit-Learn clustering algorithms. Sklearn requires all the features to be stored in a 2-D array (numpy array/ scipy sparse matrix/pandas dataframe). The aim is to create a dataframe with similar structure to the sample datasets in sklearn. As this is unsupervised learning, we won't have a target data-structure but only a feature data structure, which we shall call 'Dataframe'.
In [1]:
import scipy.sparse
import numpy as np
import sklearn as skl
import pylab as plt
We read in the data from the Unique_ICSD.dat file (which should always contain the raw data after whatever filtering process we apply on the master icsd-ternaries.csv file). Dataframe will be of dimension (nsmaples $\times$ nfeatures), where:
nsamples: Number of unique ternary compounds
nfeatures: columns 0:104: Number of atoms of element as defined by the dictionary dict_elements.
column 105: Space Group number
Dataframe will be a scipy csr sparse matrix as this feature space is by definition very sparse.
In [3]:
import csv
with open('ICSD/Unique_ICSD.dat','r') as f:
data_1=csv.reader(f,"excel-tab")
list_data1=[[element.strip() for element in row] for row in data_1]
for row1 in list_data1:
row1[1]=row1[1].replace(' ','')
list_space=[row1[1].rstrip('Z').rstrip('S').rstrip("H").rstrip('R') for row1 in list_data1]
In [4]:
with open("ICSD/spacegroups.dat",'r') as f:
dat=csv.reader(f,dialect='excel-tab',quoting=csv.QUOTE_NONE)
list_dat=[element.strip() for row in dat for element in row ]
list1=[[int(list_dat[i*2]),list_dat[i*2+1]] for i in range(int(len(list_dat)/2))]
dict_space={}
for i in range(len(list1)):
dict_space[list1[i][1]]=list1[i][0]
with open('ICSD/spacegroups_2.dat','r') as f1:
f=f1.readlines()
for line in f:
data2=[element.strip() for element in line.split()]
if data2[1] not in dict_space.keys():
dict_space[data2[1]]=int(data2[0])
with open('ICSD/spacegroups_3.dat','r') as f1:
f=f1.readlines()
for line in f:
data3=[element.strip() for element in line.split()]
if data3[0] not in dict_space.keys():
dict_space[data3[0]]=int(data3[1])
In [5]:
space_num_array=np.zeros(len(list_space),dtype=float)
for i,s in enumerate(list_space):
space_num_array[i]=dict_space[s]
In [9]:
from pymatgen.matproj.rest import MPRester
from pymatgen.core import Element, Composition
In [10]:
element_universe = [str(e) for e in Element]
dict_element={}
for i,j in enumerate(element_universe):
dict_element[str(j)]=i
dict_element['D']=103
dict_element['T']=104
In [11]:
stoich_array=np.zeros((len(list_data1),len(dict_element)),dtype=float)
for index,entry in enumerate(list_data1):
comp=Composition(entry[2])
temp_dict=dict(comp.get_el_amt_dict())
for key in temp_dict.keys():
stoich_array[index][dict_element[key]]= temp_dict[key]
In [17]:
Dataframe=scipy.sparse.csr_matrix(np.hstack((stoich_array,space_num_array[:,np.newaxis])))
In [19]:
print(Dataframe[0:3])
Using code taken from stackoverflow (http://stackoverflow.com/questions/8955448/save-load-scipy-sparse-csr-matrix-in-portable-data-format) to save the sparse matrix
In [20]:
def save_sparse_csr(filename,array):
np.savez(filename,data = array.data ,indices=array.indices,
indptr =array.indptr, shape=array.shape )
In [24]:
save_sparse_csr("Dataframe",Dataframe)
In [ ]: